Back

npj Digital Medicine

85 training papers 2019-06-25 – 2026-03-07

Top medRxiv preprints most likely to be published in this journal, ranked by match strength.

1
Evaluating the AI Potential as a Safety Net for Diagnosis: A Novel Benchmark of Large Language Models in Correcting Diagnostic Errors
2026-02-24 health systems and quality improvement 10.64898/2026.02.22.26346832
#1 (25.6%)
Show abstract

BackgroundDiagnostic errors are a leading cause of preventable patient harm, often occurring during early clinical encounters where diagnostic uncertainty is maximal. Large language models (LLMs) have shown potential in medical reasoning, yet their ability to function as a diagnostic safety net, specifically by identifying and correcting human diagnostic errors, remains systematically unquantified. We evaluated whether state-of-the-art LLMs can effectively challenge, rather than merely confirm, ...

2
PaiX Net: A Next-Generation Second-Opinion Platform for Pathology
2026-02-09 pathology 10.64898/2026.02.04.26345344
#1 (24.7%)
Show abstract

Pathology faces persistent challenges including a global shortage of specialists, uneven access to expertise, increasing diagnostic complexity, and a growing need for second-opinion consultations. While digital and telepathology platforms address parts of this problem, existing solutions often trade accessibility for structured, workflow-aware clinical integration. At the same time, multimodal medical AI shows promise for diagnostic support but raises concerns regarding transparency, automation ...

3
MedOS: AI-XR-Cobot World Model for Clinical Perception and Action
2026-02-23 health informatics 10.64898/2026.02.18.26345936
#1 (24.1%)
Show abstract

Medicine historically separates abstract clinical reasoning from physical intervention. We bridge this divide with MedOS, a general-purpose embodied world model. Mimicking human cognition via a dual-system architecture, MedOS demonstrates superior reasoning on biomedical benchmarks and autonomously executes complex clinical research. To extend this intelligence physically, the system simulates medical procedures as a physics-aware model to foresee adverse events. Generating and validating on the...

4
Red-Teaming Medical AI: Systematic Adversarial Evaluation of LLM Safety Guardrails in Clinical Contexts
2026-03-05 health informatics 10.64898/2026.02.26.26347212
#1 (24.1%)
Show abstract

BackgroundLarge language models (LLMs) are increasingly deployed in medical contexts as patient-facing assistants, providing medication information, symptom triage, and health guidance. Understanding their robustness to adversarial inputs is critical for patient safety, as even a single safety failure can lead to adverse outcomes including severe harm or death. ObjectiveTo systematically evaluate the safety guardrails of state-of-the-art LLMs through adversarial red-teaming specifically designe...

5
Synergistic barriers to algorithmic recourse in healthcare and administrative systems
2026-02-26 health systems and quality improvement 10.64898/2026.02.22.26346836
#1 (24.0%)
Show abstract

Algorithmic decision systems mediate access to healthcare, credit, employment and housing, yet individuals who experience adverse decisions face multi-stage barriers when seeking recourse. We formalize these barriers as a series-structured system with 11 empirically parameterized stages across three layers (data integration, data accuracy and institutional access) and prove that single-barrier interventions are bounded by baseline system success. Under baseline parameterization derived from fede...

6
Ed-Triage-Agent: A Framework For Human-Ai Collaborative Emergency Triage
2026-02-18 health informatics 10.64898/2026.02.17.26346501
#1 (23.9%)
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWEmergency Department triage is a critical decision-making process in which clinicians must rapidly assess patient acuity under high cognitive load and time pressure. We present ED-Triage-Agent (ETA), a multi-agent AI framework designed to augment clinical decision-making in Emergency Severity Index (ESI) classification through human-AI collaboration. The system operates in two phases: (1) autonomous patient intake via a conversational agent that collects structured sympto...

7
Population differences in wearable device wear time: Rescuing data to address biases and advance health equity
2026-03-06 health informatics 10.64898/2026.03.06.26347799
#1 (23.6%)
Show abstract

Wearable devices present transformative opportunities for personalized healthcare through continuous monitoring of digital biomarkers; however, individual variations in device wear time could mask or otherwise impact signal identification. Despite the widespread adoption of wearable devices in research, no comprehensive framework exists for understanding how wear time varies across populations or for addressing wear time-related biases in analysis. Using Fitbit data from 11,901 participants in t...

8
Representation Before Retrieval: Structured Patient Artifacts Reduce Hallucination in Clinical AI Systems
2026-02-16 health informatics 10.64898/2026.02.13.26346256
#1 (22.6%)
Show abstract

BackgroundLarge language models show promise for clinical decision support, yet their propensity for hallucination--generating plausible but unsupported claims--poses sub-stantial patient safety risks. Retrieval-augmented generation (RAG) is widely assumed to mitigate this problem by grounding outputs in retrieved documents, but this assumption remains inadequately tested in clinical contexts where information density, temporal complexity, and safety stakes are uniquely high. MethodsWe develope...

9
Multi-Model Clinical Validation of an AI-Powered Biomarker Analysis Framework: A Cross-Vendor Benchmark on 4,018 NHANES Patients
2026-02-17 health informatics 10.64898/2026.02.13.26346284
Top 0.1% (22.4%)
Show abstract

BackgroundLarge language models (LLMs) show promise for clinical decision support, yet most validation studies evaluate single models, leaving questions about generalizability and vendor dependence unanswered. We assessed whether a standardized biomarker analysis framework maintains clinical-grade accuracy across multiple LLMs from independent providers. MethodsWe developed a structured prompt-based framework for detecting eight clinical patterns (insulin resistance, diabetes, cardiovascular di...

10
Development and retrospective validation of SCOUT: scalable clinical oversight of large language models via uncertainty triangulation
2026-02-10 cardiovascular medicine 10.64898/2026.02.08.26345860
Top 0.1% (22.3%)
Show abstract

Large language models (LLMs) are increasingly used in clinical workflows, yet requiring clinician review of every AI output negates the efficiency gains that motivate their adoption. We present SCOUT (Scalable Clinical Oversight via Uncertainty Triangulation), a model-agnostic meta-verification framework that selectively defers unreliable LLM predictions to clinicians by triangulating three orthogonal signals: model heterogeneity, stochastic inconsistency, and reasoning critique. In this retrosp...

11
Personalized Insights Derived from Wearable Device Data and Large Language Models to Improve Well-Being
2026-03-04 health informatics 10.64898/2026.03.03.26347299
Top 0.1% (22.1%)
Show abstract

Health behaviors such as physical activity and sleep affect mental health, but the effect of each health behavior varies substantially across individuals, limiting the usefulness of generic behavioral recommendations. We collected one year of continuous wearable and ecological momentary assessment data from 3,139 participants in the Intern Health Study (2018-2023), and examined individual-level associations between wearable-derived features and mood across the internship year. The behaviors asso...

12
Artificial Intelligence in Healthcare: 2025 Year in Review
2026-02-28 health informatics 10.64898/2026.02.23.26346888
Top 0.1% (21.9%)
Show abstract

BackgroundBreakthroughs in model architecture and the availability of data are driving transformational artificial intelligence in healthcare research at an exponential rate. The shift in use of model types can be attributed to multimodal properties of the Foundation Models, better reflecting the inherently diverse nature of clinical data and the advancing model implementation capabilities. Overall, the field is maturing from exploratory development towards application in real-world evaluation a...

13
Handling onset age inconsistencies in longitudinal healthcare survey data
2026-02-23 health informatics 10.64898/2026.02.20.26346741
Top 0.2% (18.1%)
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWLongitudinal healthcare surveys frequently contain inconsistencies in self-reported onset ages, where participants report different ages for the same condition between enrollment and follow-up surveys. We propose two methods to handle this challenge. First, we introduce a procedure that aggregates inconsistency patterns to construct participant-level reliability scores, enabling researchers to stratify participants and prioritize analysis on high-reliability cohorts. Seco...

14
GEN-KnowRD: Reframing AI for Rare Disease Recognition
2026-03-03 health informatics 10.64898/2026.03.02.26347469
Top 0.3% (17.7%)
Show abstract

Rare diseases affect over 300 million people worldwide, yet patients often endure years-long diagnostic delays that limit timely intervention and trial opportunities. Computational rare disease recognition (RDR) remains constrained by knowledge resources that are often incomplete, heterogeneous, and dependent on extensive multi-disciplinary expert curation that cannot scale. Large language models (LLMs) applied directly for end-to-end diagnosis or disease discrimination face similar knowledge bo...

15
How Agent Role Structure Alters Operating Characteristics of Large Language Model Clinical Classifiers: A Comparative Study of Specialist and Deliberative Multi-Agent Protocols
2026-02-24 health informatics 10.64898/2026.02.22.26346818
Top 0.3% (17.6%)
Show abstract

Large language models (LLMs) are increasingly deployed in structured clinical decision support, yet the architectural effects of internal role decomposition within multi-agent systems remain poorly isolated. Prior comparisons of single-agent and multi-agent prompting frequently confound workflow structure with changes in model configuration, training, or decoding. We present a controlled architectural study of role-structured inference under fixed model parameters, isolating internal role decomp...

16
A deterministic safety pipeline for therapeutic AI in elderly assisted living
2026-02-18 health informatics 10.64898/2026.02.17.26346507
Top 0.3% (17.4%)
Show abstract

Over 54 million Americans are aged 65+, with depression affecting 25-49% and anxiety exceeding 30% of assisted living residents. AI systems employing agentic orchestration exhibit 0.5-2% failure rates--unacceptable where a single missed crisis can be fatal. We designed and bench-evaluated Lilo Engine, a 5-layer deterministic therapeutic pipeline replacing a prior multi-agent orchestrator. Safety is enforced through structural invariants: a Guardian layer with 4-gate OR crisis detection runs unco...

17
OCR-Mediated Modality Dominance in Vision-Language Models: Implications for Radiology AI Trustworthiness
2026-02-24 health informatics 10.64898/2026.02.22.26346828
Top 0.3% (17.4%)
Show abstract

1.BackgroundVision-language models (VLMs) are increasingly proposed for radiologic decision support, yet the security implications of deploying general-domain, OCR-capable models in diagnostic workflows remain poorly characterized. When image-embedded text is not treated as untrusted input, the visual channel becomes vulnerable to adversarial manipulation through OCR-readable overlays. MethodsNine commercial VLMs, none intended or validated for clinical diagnosis, were evaluated on 600 brain MR...

18
On the robustness of medical term representations in locally deployable language models
2026-02-26 health informatics 10.64898/2026.02.24.26346972
Top 0.4% (17.1%)
Show abstract

Structured AbstractO_ST_ABSBackgroundC_ST_ABSHosting large language models (LLMs) on-premises can secure patient data but requires compact architectures to function on standard hardware. The impact of such constraints on the robustness of their representations for medical terminology is important for clinical AI safety but poorly understood. The statistical nature of LLM training inherently limits the representation of terms with low societal prominence or lexical frequency, and high ambiguity. ...

19
Model Development and Real-World Deployment of Multimodal Input-Based Subtyping of Depression in Tele-Counseling for Scalable Mental Health Assessment
2026-02-18 psychiatry and clinical psychology 10.64898/2026.02.11.25342657
Top 0.4% (17.1%)
Show abstract

The rapid growth of tele-counseling and the use of lay counselors in high-volume, low-resource mental health services has created a need for scalable tools for early detection and triage. Effective personalization now requires stratifying individuals by dominant symptom profiles, such as appetite, agency, anxiety, and sleep disturbances. Depression symptoms vary widely, even among those with similar scores, reflecting distinct psychophysiological and cognitive-affective patterns. In tele-mental-...

20
SydneyMTL: Interpretable Multi-Task Learning for Complete Sydney System Assessment in Gastric Biopsies
2026-02-18 pathology 10.64898/2026.02.17.26346304
Top 0.5% (14.9%)
Show abstract

The Updated Sydney System (USS) provides a standardized framework for grading gastritis and stratifying gastric cancer risk. However, subjective observer variability and labor-intensive workflows impede its routine clinical use. To address these challenges, we developed SydneyMTL, a multi-task deep learning framework that uses Multiple Instance Learning (MIL) with task-specific attention pooling to predict severity grades across all five USS attributes simultaneously. Trained on an unprecedented...